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Abstract 



We consider the problems of ridge (-^-regularized) and lasso (Li-regularized) linear re- 
gression in a partial-information setting, in which the learner is allowed to observe only 
a fixed number of attributes of each example at training time. We present simple and 
£C} ' efficient algorithms for both problems, that are optimal (up to logarithmic factors) in the 

sense that they require to observe the same number of attributes as do full- information 
algori thms. By that, we answer an open problem recently posed by iCesa-Bianchi et alj 
()2010t ). and show their lower bound to be tight. 

H-j 

c/5 ' 1 Introduction 

O ' 

In this note we consider the problem of linear regression in the partial information setting - i.e. 
where the learner is allowed to observe only a few attributes of each example, which we formally 
define next. 

' Linear regression. In the linear regression problem, each instance is a pair (x, y) of an attributes 

ly-j \ vector x S M. d and a target variable y G R. We assume the standard framework of statistical learning, 

■^j- ■ in which the pairs (x, y) follow a joint probability distribution V over R d x R. The goal of the learner 

is to find a vector w for which the linear rule y <s— w T x provides a good prediction of the target y. 
To measure the performa nce of the prediction, we use a loss function l(y, y) : M 2 — » R. Following 
ICesa-Bianchi et al.1 (|2010l) . we focus on the square loss £(y,y) = \{y — y) 2 , which stands for the 
common least-squares regression. Hence, in terms of the distribution T>, we would like to find a 
predictor w e R d with low expected loss, defined as 

L v (w) =E (Xiy) ^(w T x, y)} . (1) 



The standard approach in this regard is seeking a vector w£l that minimizes a tradeoff between 



the expected loss and an additional regularization term, which is usually a norm of w. An equivalent 
form of this optimization problem is obtained by replacing the regularization term with a proper 
constraint, giving rise to the problem 

min Lp(w) s.t. llwlL < B , (2) 

wGK d 

where B > is a regularization parameter and p > 1. The most important instances of this problem 
are widely known as "ridge regression" (p = 2) and "lasso regression" (p = 1). 

Since the distribution D is unknown, we approach the problem ^ by exploiting a training set 
S = {(x t , yt)}^L\ of examples, that are assumed to be sampled independently from T>. Then, we 
can estimate the expected loss ((TJ) by computing the training loss over S, 

Mw) = -££(w T x,y), (3) 
i=l 

and instead of attacking ^ directly, we consider the constrained optimization problem 

min L s (w) s.t. ||w|L < B . (4) 

wgl d 



We distinguish between two learning scenarios. In the full information setup, the learner has 
unrestricted access to the entire data set. In the partial information setting, for any given 
example pair (x, y), the learner can observe y, but only k attributes of x (where A; is a parameter of 
the problem). The learner can choose which attributes to observe. 

Results. In this work, we provide optimal algorithms (up to logarithmic factors) for ridge and 
lasso regression in the partial information setting. The algorithms are optimal in the sense that they 
require an order of Q(d/k ) examples to reach a certain fixed accuracy, whereas the lower bound of 
ICesa-Bianchi et ail (|2010t ) implies that VL(d/k) examples are needed in general. That is, our upper 
bounds match this lower bound up to logarithmic factors (and constants). More specifically, for our 
ridge regression algorithm, we prove the upper bound 



E [£x>(w)] < min Lp(w*) + O [ B 

II W* || 9<-B 



2 l d / k 



while for our lasso regression algorithm we establish the bound 

IJd/k) logd 



E[£x>(w)] < min L-p(w ic 

II || i <B 




w*||i<b ' \ V m 



Here we use w to denote the output of each algorithm on a training set of m examples (when 
configured properly) , and the expectations are taken with respect to the randomization of the algo- 
rithms. In particula r, setting k = d we ob tain the known bounds for full information ridge and lasso 
regressions (see e.g. iKakade et al.l (|2008l )). 

The bounds imply that our algorithms require 0(d/ke 2 ) examples in order to reach an accuracy of e, 
i.e. they need the same number of attributes as their full information counterparts. The algorithms 
themselves are very simple to implement, and require only 0(1) processing time for each observed 
attribute (hence the overall runtime of both is 0(d/ke 2 )). 

Previous work. In this note we continue the investigation of ICesa-Bianchi et al.l (|2010T ) to at- 
tribute efficient learning. The reader is referred to their manuscript for more detailed background 
and historical references. Of particular interest is the following lower bound on the sample complex- 
ity needed to obtain a certain accuracy: 

Theorem 1 (]Cesa-Bianchi et al.l (|2010h Theorem 3). For any 0<e<j^,k>l and d > Ak, there 
exist a distribution over examples T> and weight vector w*, with ||w+||o = 1, ||w+||i = ||w*||2 = 2,/e, 
such that any regression algorithm accessing at most k attributes per training example must see (in 
expectation) at least examples in order to learn a linear predictor w with Lx>(w) — Lx>(w ir ) < e. 

This lower bounds complement our upper bounds, and show our bounds to be tight to within poly-log 
factors. 



2 Bandit online convex optimization 

We begin by presenting the primary building blocks of our regression algorithms. The techniques 
we describe can be classified as bandit (i.e. partial feedback) algorithms for the Online Convex 
Optimization problem, which we describe next. 

The online convex optimization problem of Zin kevichl (|2003f ) is defined as the following repeated 
game between a learner and the environment. At each time step t — 1, . . . , T, the learner chooses 
a vector w t from a convex set W C R d , which we refer to as the decision set. Subsequently, after 
observing the choice w t , the environment reveals a convex cost function ct ■ W — > K and the learner 
incurs the cost Ct(wt). The goal of the learner is to minimize his regret, defined as 

T T 

Rt = c t (w t ) - min c t (w*) . 
t=i t=i 

A non-trivial strategy should yield the learner a regret that is sublinear with respect to the time 
horizon T. Indeed, I Zinkevichl ([2003) shows that a regret of 0(y/T) is achievable, even when the 
environment is adversarial. His algorithm relies on the gradient Vt = Vc t (w t ) for making the 
learner's decision (after time t). That is, it is enough for the learner to observe only Vt (instead of 
the entire function ct) in order to establish his strategy. 
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In this work, we focus on gradient-based algorithms that only observe an unbiased estimate V t of 
the gradient Vt, instead of the entire c t . That is, at time step t the learner observes a vector V* for 
which Et[Vt] = Vt, and need to base his decision by relying on Vt exclusively. We use the notation 
E t [-] to denote the conditional expectation conditioned on all randomness in the first t—\ time 
steps. 

2.1 Bandit Gradient Descent 

We first present the algorithm of lZinkevichl (|2003l ). extended to our setting in which at each time step 
the learner observes V t instead of the true gradient V t — originally due to iFlaxman et al.1 (|2005l ). 
The only assumption we shall require on the estimates Vt is that the norms ||Vt||| are bounded 
in expectation, i.e. that there exists some G > with Et[||Vt||§] < G 2 for all t . In other words, 
we assume the estimates V t are of "bounded variance", where this variance is measured by the L 2 
norm. In particular, each estimate can be weakly bounded, or even unbounded, without affecting 
the performance of the algorithm as long as the latter variance is bounded. 

The Bandit Gradient Descent (BGD) algorithm is given as Algorithm [TJ For our needs, it is enough 
to state it in its simplest form, where the decision set is the Euclidean ball 82(B) = {w G M. d : 
||w|| 2 <B}. 



Algorithm 1 Bandit Gradient Descent (over an L2-ball) 



Input: 77 > , B > . 
Let wi <— Od 
for t = 1 to T do 
Predict w t 

Observe an unbiased estimate Vt for the gradient Vt = Vct(wt) 
Update: 

eh B z t+ i 

Zt+l <- W t - 77V t , Wt+i ' 



max{||zt + i|| 2 , B} 
7: end for 



In the following Lemma we provide a bound on the expected regret of Algorithm [TJ It is proved in 
the Appendix for completeness. 

Lemma 2. Assume that E t [||Vt|| 2 ] < G 2 for all t G [T]. Then, for 77 = -^jj, and for any fixed 
w* G 62(B), we have 



^Cf(Wf) 



<Y,c t (w±)+BGVT . 



E 

_t=i J t=i 

2.2 Bandit Exponentiated Gradient 

We now turn to consider a different algorithm, suitable for a situation in which the observed gradient 
estimates Vt have |]Et[Vf]||oo < G 2 . That is, when we have a bound over the variance of each 
individual entry of Vt- In this case, Algorithm [JJ performs poorly (in terms of expected regret) since 
Et[||Vt|| 2 ] often has an implicit dependence on the dimension d. 

T he algorithm we provid e is a stochastic version of the Exponentiated Gradient (EG) algorithm 
of iKivinen and Warmuthl (|1997| ) that employs multiplicative updates, with an additional step that 
involves a "clipping" of Vt- In this regard, we use the notation 

clip(x, c) = max(min(a:, c), — c) 

for scalar values x G M and c > 0. This clipping operation prevents the updates from being "too 
large", which is crucial for the stability of a multiplicative algorithm. 

For clarity, we first describe the algorithm in the simpler case in which the decision set W is the 
unit simplex, which we denote by A^. Later, we show how the algorithm can be extended to work 
over an Li-ball. 

2.2.1 Online optimization over the simplex 

The Bandit Exponentiated Gradient (BEG) algorithm for the case of optimization over the simplex 
is given in Algorithm O 

In the following lemma we establish a simple bound on the expected regret of Algorithm [21 
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Algorithm 2 Bandit Exponentiated Gradient over the simplex 



Input: r\ > . 
Let Zi <— 1^ 
for t = 1 to T do 
Predict wj, where 

z t 

w t 



Z t 1 



5: Observe an unbiased estimate V t for the gradient Vt = Vct(wt) 
6: Let: 

V 4 (i) <- clip(V t (i), 1/?/) for all i e [d] 

7: Update: 

z f+ i(i) «— Zt(i) • exp(— rjVt(i)) for all i e [d] 

8: end for 



Lemma 3. Assume that \\E t [Vt}\\oo < G and \\E.t\St\\\oo < G 2 for all t E [T], and that T > logd. 



Then, for r\ 



1 . /logd 



G V 5T 



§fr and for any fixed w* S A^, we have 



E 



E c *( Wt ) 



t=i 



< ^ct(w*) + GV5Tlogd 



Proof: First, assume that 77 < 1/2G. By Lemma[S](in the appendix) we have 

|E*[Vt(i)] - V t (i)| < 27/EtlVt^) 2 ] < 2?7G 2 

for all i € [d], so that ||E t [V t ] - V t ||oo < 2?yG 2 . Thus, 

V t T (w t - w*) = E t [V t ] T (w t - w*) + (V t - E t [Vt]) T (w t - w*) 

< E t [Vt] T (w t - w*) + ||E t [V t ] - Vt||oo||wt - w*||i 
<E t [V t T (wt-w,)]+4?7G 2 

and by taking the expectation of both sides we obtain 

E[V t T (w t - w*)] < E[V ( T (w t - w*)] + 4ryG 2 . 

On the other hand, applying Lcmma[7]on the vectors Vi, . . . , Vt, we have 



£ w * T V^mmEV'W 

te[T] 1 J te[T] 



logd 
j] 



•»? E w ^ 

te[T] 



= E <v 

te[T] 



logd ^ 
' te[T] 



wjV^ 



Taking the conditional expectation gives 



E w t T Et[Vt] < £ wjEt[Vt] 

te [T] te [t] 



logd 
?7 



+ 7? £ w t T E t [V 2 ] 



te[T] 

< £ wlEt[Vt] + ^+r7G 2 T 



te[T] 



and by rearranging and taking expectations we obtain that 

V E[V t T (wt-w,)]<^ +?? G 2 T 
te[T] ' 



(5) 



(G) 
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Algorithm 3 Bandit Exponentiated Gradient over an Li-ball 



Input: rj > 0, B > . 
Let z+ <- l d , z~ <- 1 
for t = 1 to T do 
Predict wj, where 



5 t Hi I" ll^t Hi 



5: Observe an unbiased estimate V* for the gradient V< = Vc f (w 4 ) 
6: Let: 

Vt(i) <- clip(V t (i), I/77) for all i e [d] 

7: Update: 

z t+i(0 <~ z t"(*) ' ex P( — V^t(i)) for all i € [d] 
z *+i(*) ^~ z t~C0 ' ex P(+ 7 7V f (i)) for alii G [d] 

8: end for 



Putting ((SJ) and © together, we get 

^ E[ct(w t ) - ct(w*)] < ^ E[V t T (w t - w t )] 
te[T] te[T] 

< ^ E[V t T (w t - w*)] + A V G 2 T 
te[T] 



< 5r)G T 



2rr, , ^gd 



Finally, setting 77 = -t y we obtain the desired bound. Note that for this choice of rj we have 
V < 1/2G (since T > logd), as was initially assumed. ■ 

2.2.2 Online optimization over the Li-ball 

We now describe how Algorithm [5] can be leveraged to an online optimization algorithm over the 
Li-ball of radius B, denoted by Bi(B) = {w 6 M. d : ||w||i < B}. This is accomplished by utilizing 
the following mapping from the 2e?-dimensional simplex A2d onto the d-dimensional ball B\(B): 

w 1 y _B(w + — w _ ), 

where here and henceforth w + , w~ G M. d denote the vectors for which w = (w + , w _ ), for any 
w e M 2d . This gives rise to the algorithm given in Algorithm [3] A regret bound similar to that of 
Algorithm [2] is derived in the following Lemma. 

Lemma 4. Assume that HE^V^H^ < G and ||E t [V(]||oo < G 2 for all t E [T], and that T > logd. 



Then, for rj — -gU \ -J £ d and for any fixed w* 6 B\{B), we have 



E 



T 

53 c *( w *) 

,i=l 



< ^ ct(w*) + SG^ST log 2d 

t—i 



Proof: The main observation of the proof is that the updates of the above algorithm are equivalent 
to those of an BEG algorithm over the 2d-dimensional simplex A2d, on the cost functions 

ct(w) = ct(Bw + — _Bw~) . 

The predictions of the algorithm can be expressed as w t = B(\vf — w^~), where w t £ A2d is the 
prediction of the BEG algorithm at iteration t. Note that 

||w t ||i < B(\\w+\\i + ||w t -|K) = B]|wt|U = B 

so that w t S Bi(B), as required. It remains to bound the expected regret of this instance of BEG, by 
means of Lemma El At iteration t, the unbiased estimator gt = S(Vt,— Vt) is used for estimating 
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the gradient Vc t (w t ) = £?(Vct(w t ), — Vc t (w t )). It is easy to verify that UE^fftjHoo < BG and 
1 1 Et [gf] | |oo < (BG) 2 , so together with Lemma [3] this implies that 



E 



' T 



where we set 77 = -J^ 1 ' log2d 



E 



^c t (w t ) 



< min V c t (w )+BGJbT log 2d 
weA 2d ^ 
te[T] 



min ^ c f (w) + BGv/^Tlog^ 



||w||i<B 
T 



iG[T] 

< ^ c t (w*) + BG^/hT log 2d , 



/=i 



3 Algorithms for attribute-efficient regression 

In this section we present and analyze our partial-information algorithms for the ridge and lasso 
regression problems. We first describe algorithms that only view 2 attributes of each training 
example. Later, we note how these algorithms can be used in the setting in which we are allowed to 
observe at most 2k attributes of each example. 

3.1 Ridge regression 

In the ridge regression problem, we are interested in the linear predictor that is the solution to the 
optimization problem ([¥]) with p — 2, that can be written explicitly as 



-. m 



mm 

||w|| 2 <b m 

Z—l 

Based on this form, it is straightforward to cast the problem as an online convex optimization 
problem over the L2-ball {w € M. d : || w|| 2 < B}, with cost functions 

Ct (w)-i(w T x t -y t ) 2 ■ (8) 

In the full information case, this problem can be solved efficiently via a simple Bandit Gradient 
Descent algorithm. For applying this algorithm, we should calculate the gradient Vt = Vct(w t ) at 
iteration t, which in our case is given explicitly as 

V t = (w t T x t - y t ) • x t . (9) 

In the partial information case, we are not able to use © since the vectors x t are not fully observed. 
Instead, we employ Algorithm [T] with an estimation of the gradient Vt. We may estimate this 
gradient relying on merely 2 attributes of Xt — one is used for estimating the vector xt, while 
the other for estimating the dot product wJTx t . For estimating x t , we pick € [d] according to 
the uniform distribution over [d], observe the attribute x t (i) and define the random vector x f = 
dx t (i^) ei t where e^ is used to denote the ith element of the standard basis of K d . It is easy to verify 
that Xt is an unbiased estimator of x t , i.e. Ef[xt] = x t . As for the estimation of w^Xj, we pick 
ji, G [d] according to the distribution Pr[j] = Wt(j) 2 /||wt||2, observe the attribute x.t(j*) and define 
the random variable yt = Xt(j <r )||wt Hl/wtO'*)- This serves as an unbiased estimator of w^Xt, as 
Ef[j/t] = w^Xf. Hence, since xt and yt are independent, we obtain the following unbiased estimator 
for the gradient Vt at iteration t: 

Vt = (y t - Vt) ■ xt . (10) 

An important feature of the specific sampling procedure used here, as the proof of Theorem shows, 
is our ability to provide a sharp bound over the "variance" term E t [|| VtH 2 ,]. As discussed in section[2J 
such bound entails an improved regret guarantee for the underlying BGD algorithm. 

The resulting algorithm is given as Algorithm 0] Note that only a single entry of the predictor w t 
(at index i*) is updated on each iteration. In the following theorem we establish a convergence 
guarantee for this algorithm. 



G 



Algorithm 4 Attribute-efficient ridge regression 



Input: training set S = {(x t , j/t)}te[m]i regularization parameter £? > 0, learning rate r) > 
Let wi <— 0^ 
for t — 1 to to do 
Let: 

B W( 

w t ' 



max{||w t || 2 ,B} 

5: Choose 6 [d] with probability and observe Xt(£*) 

6: Choose S [d] with probability wf (j)/||w t || 2 , and observe x t (j*) 

7: Let: 

& «- [x t (i*)||w t ||2/w t (i*) - y t ] ■ dx. t (i*) 

8: Update: 

9: end for 

10: return w = i w t 



Theorem 5. Assume the distribution T> is such that 1 1 x| 1 2 < 1 M < 1 with probability 1. Lei 



2v 



Then, for any fixed w* G R d m'i/i 



w &e t/ie output of Algorithm [7J lo/ien run lozi/i 77 
1 1 w * II 2 < -B $ holds that 

E [ip(w)] < Lpfw,) + 2B a W — 

V to 

Proof: It is easy to verify that E t [||x t |||] < d and E t [y t 2 ] < B 2 , so that E t [||V t |||] < AB 2 d. LemmaH 
(with D = B, G = 2B\fS) now implies that for rj = —j= 



2Vdm ' 



E 



m 1 , m j—r 

1 £ \i^t - yt? < - £ l« x * " ^*) 2 + 2i?2 V - • 

TO ^ TO V TO 



t=l J t=l 

Taking the expectation of both sides with respect to the random choice of the training set, we get 



E 



— L v (w t ) 



< L B (w,) + 2B< 



Finally, letting w = Y^tLi w t an d recalling the convexity of L-p, we conclude that 



E [L v (w)] < E 



^ m 

— V^x>(w t ) 



<Lx,(w,) + 2B^/- . 

TO 



3.2 Lasso regression 

We now turn to describe our algorithm for the lasso regression problem, in which we would like to 
solve the optimization problem 

-j m 

min -Vl(w T x t -y f ) 2 . (11) 

||w||i<S TO Z 

In this case, we formulate the problem as an online problem over the Li-ball B\{B) with the cost 
functions ([5]), and employ Algorithm [3] with an estimation of the gradient V t . This time, we estimate 
the gradient (again using 2 attributes of x 4 ), as follows. We estimate x t with x t , exactly as we did in 
the case of ridge regression. However, for estimating wjxt, we use a slightly different procedure. We 
pick £ [d] according to the distribution Pr[j] = |w t (j)|/||w t ||i, observe the attribute x 4 (j*) and 
define the random variable yt = sign(w t (j, t ))||w t ||ix t (j jr ). This serves as an unbiased estimator of 
wJTxt, as Et[y a ] = w^xt. Hence, (fTUf gives an unbiased estimator for the gradient V* at iteration t. 
The resulting algorithm is given as Algorithm [5] As with Algorithm [5J only a single entry of 
the predictor w t (at index i*) is updated on each iteration. In the following theorem we prove a 
convergence guarantee for this algorithm. 
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Algorithm 5 Attribute-efficient lasso regression 



Input: training set S = {(x t , j/t)}te[m]i regularization parameter £? > 0, learning rate r) > 
Let z^ «- l d , Zi «- l d 



for i = 1 to to do 

Let: 



ll z t Hi + ll z t Hi 
Choose u £ [d] with probability 1/d, and observe xj(i*) 
Choose j+ £ [d] with probability |w t (j)|/||w t ||i, and observe x t (j*) 
Let: 

g t 4- [sign(w t (j+)) ||w t ||i x t {j*) ~ Vt] • dx t (i*) 
g~t <- clip(y t , l/rf) 



Update: 



9: end for 

10: return w = i ^t=i w * 



z m(**) <~ z ^(**) ' exp(-?7£?t) 
z r+i(^) <~ z r(**) ' exp(+?7.g t ) 



Theorem 6. Assume the distribution T> is such that ||x||oc, < 1 and \y\ < 1 tozi/i probability 1. Lei 
w 6e i/ie output of Algorithm^ when run with rj = -^gj \f^^M~ ' Then, for any fixed w* € R d iw'i/i 
II w * 111 < -S ^ /ioZc!s i/ia£ 



E[i D (w)] <Lx,(w*) + 5£ : 



d log 2d 



Proof: Let ic t ,y t ,V t be defined as in the text above. A straightforward calculation shows that 
|E t [V t ]||oo < 2B . In addition, we have E f [y t 2 ] < B 2 and HE^xfJHoo < d which imply that 



|E*[V 2 ]|U 



< 4B 2 d. Thus, Lemma [3] (with G = 2By/d) ensures that 



E 



/ j 2 



1 m 

<l^I(w!x t - yt ) 2 + 2S 5 



5dlog2d 

TO 



Taking the expectation with respect to the random choice of the training set, we have 

— y]£r>(w t ) 



E 



< L c (w + ) + 2B^ 



5dlog2d 



m 



Finally, letting w = — Y^itLi w * an< ^ exploiting the convexity of L-p, we obtain the theorem. 



3.3 Learning from 2k attributes per example 

If we are allowed to observe more than 2 attributes of each training instance, we can further improve 
the bounds of Theorems \5\ and [51 Let us demonstrate that in the case of lasso regression; the same 
method applies to ridge regression as well. Say we are given a budget of 2k observable attributes 
per training instance. Then, we may simply use each training instance k times, instead of just once. 
Essentially, this has the same effect as training with Algorithm [5] over a set that is k times larger. 
Hence, with the notations of Theorem we obtain the bound 



E[Lu(w)] < Lu(w*) + 



(d/fc) log 2d 



As already discussed earlier, this implies that Algorithm needs essentially the same amount of 
attributes as a full-information algorithm. 
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A Auxiliary Lemmas 



We first prove Lemma [2J 

Proof: Following the analysis of lZinkevTc h (2003) we first note that, since w t+ i is the projection of 
z t+ i on the Euclidean ball of radius B, 

||w t+ i - w*|| 2 < ||z t+ i - w*|| 2 



= ||w t - w* - rfS/ t \\ 

= \\w t - w*|| 2 + 77 2 ||V t || 2 - 2r ? V t T (w t - w*) 



and by rearranging, we get 



V t T (w t - w*) < — (||w, - w*|| 2 - ||w t+ i - w*|| 2 + r? 2 || V t || 2 ) . 

The convexity of c± implies 

c t (w t ) - ct(w*) < V t T (w t - w*) = E t [V t ] T (w t - w*) = E t [V t T (w t - w*)] 
and by taking the expectation of both sides, we obtain 

E[ct(w t ) - ct(w*)] < E[V t T (w t - w*)] . 
Putting (fT2"|) and (fTH]) together and summing over t, we have 

T T 

^E[c t (w t ) - ct(w*)] < ^E[V t T (w t - w*)] 



< ^f] E[||w 4 - w,|| 2 - ||w m - w,|| 2 ] + | 



1=1 



< l|| Wl _ w ^||2 + !? G 2 r 

- 277 11 11 2 

< Lb 2 + 7 ±g 2 t . 

- 2n 2 



Setting r] = -^j^ we conclude that 



E 



E c tOt) 



{=1 



£> t (w*) <bgVt. 



(12) 



(13) 



The following lemma is a simplified version of Lemma 2.3 from IClarkson et al.l (|2010h . 
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Lemma 7. Letrj > and letv±,... , Vt be an arbitrary sequence of vectors inM. d withv t (i) > — 1/77 
for all i € [d] and t € [T]. Consider the following MW algorithm: set Wi <— 1^ and /or i > 1, 

w t+ i(») <- w t (i) • e"" v *« ViG[d]. 
Then, for the normalized vectors pt = w t /||w t || i, we /iawe 

T T T 

Et / x logd ^-^ _ 2 
p t v t <min> v t (» + + r/> p t v 4 . 

t=l 1 ' t—1 1 t=l 

Proof: Using the fact that e z < 1 + z + z 2 for z < 1, we have 

||w t+ i|| 1 = ^w t (z)-e-" vt « 

ie[d] 

< w f (i) • (1 - r?v 4 (i) + ?? 2 v t (i) 2 ) 

= l|w t ||i-(l-77 Pt T v t + 77 2 p t T v 2 ) 
and since e z > I + z for z 6 K, this implies by induction that 

log||w T+ i||i = logd + ^ log(l - ?yp t T v t + ?7 2 Pt T v 2 ) 
te[T] 

< log d - v p* T v * + ^ 2 Z p * T v * ■ ( 14 ) 

te [t] te [t] 

On the other hand, we have 



log||w T+ i||i =log [ II e~ VVt(i) > log max[|e-" Vt W = max [ -r/^v t (i) 
ue[d]te[T] / Y e[d] t=i ' ' 



that is, 



log||w T+ i||i > -r/min V" v t (i) . (15) 
ie[«fl ^ 

Combining (|14J) and (|15|) and rearranging, we obtain 

T T T 

Et , \ logo! ^-^ o 
Pt v t < mm 2^ Vt(i) + + 77 > p t v t 
i£ d 1* — ' 77 z — ' 

t=l L J t=l ' t=l 

which completes the proof. ■ 

The next lemma allows us to bound expected values of clipped random variables. 

Lemma 8. Let X be a random variable, let X — clip(JT, C) and assume that \E\X\\ < C/2 for 
some C > 0. Then 

\E[X]-E[X} \ < -|var[X] . 

Proof: Note that for x > C we have x — ELY] > C/2, so that 

C(x -C)< 2{x - ELY]) (a; -C)< 2(x - ELY]) 2 . 

Hence, we obtain 

ELY] - ELY] = / (x + C)a>x + / (x - C)a>x 

Jx<-C Jx>C 



< / (x-C)dnx 

Jx>C 

2 

< — 
" C 



/ (x-ELY]) 2 ^ 



< §Var[X] . 

Similarly one can prove that ELY] — ELY] > — 2Var[X]/C, and the lemma follows. 
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